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ABSTRACT 

We present further development and the first pubUc release of our multimodal nested sam- 
pling algorithm, called MultiNest. This Bayesian inference tool calculates the evidence, 
with an associated error estimate, and produces posterior samples from distributions that 
may contain multiple modes and pronounced (curving) degeneracies in high dimensions. 
The developments presented here lead to further substantial improvements in sampling ef- 
ficiency and robustness, as compared to the original algorithm presented in Feroz & Hob- 
son (2008), which itself significantly outperformed existing MCMC techniques in a wide 
range of astrophysical inference problems. The accuracy and economy of the MultiNest 
algorithm is demonstrated by application to two toy problems and to a cosmological in- 
ference problem focussing on the extension of the vanilla ACDM model to include spa- 
tial curvature and a varying equation of state for dark energy. The MultiNest software, 
which is fully parallelized using MPI and includes an interface to CosmoMC, is available at 
http : / / www . mrao . cam . ac . uk/ software/mult inest/. It will also be released 
as part of the SuperBayeS package, for the analysis of supersymmetric theories of particle 
physics, at http : / / www . superbayes . org 

Key words: methods: data analysis - methods: statistical 



1 INTRODUCTION 

Bayesian analysis methods are already widely used in astrophysics 
and cosmology, and are now beginning to gain acceptance in parti- 
cle physics phenomenology. As a consequence, considerable effort 
has been made to develop efficient and robust methods for perform- 
ing such analyses. Bayesian inference is usually considered to di- 
vide into two categories: parameter estimation and model selection. 
Parameter estimation is typically performed using Markov chain 
Monte Carlo (MCMC) sampling, most often based on the standard 
Metropolis-Hastings algorithm or its variants, such as Gibbs' or 
Hamiltonian sampling (see e.g. Mackay 2003). Such methods can 
be very computationally intensive, however, and often experience 
problems in sampling efficiently from a multimodal posterior dis- 
tribution or one with large (curving) degeneracies between param- 
eters, particularly in high dimensions. Moreover, MCMC methods 
often require careful tuning of the proposal distribution to sam- 
ple efficiently, and testing for convergence can be problematic. 
Bayesian model selection has been further hindered by the even 
greater computational expense involved in the calculation to suffi- 
cient precision of the key ingredient, the Bayesian evidence (also 
called the marginalized likelihood or the marginal density of the 
data). As the average likelihood of a model over its prior probabil- 
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ity space, the evidence can be used to assign relative probabilities 
to different models (for a review of cosmological applications, see 
Mukherjee et al. 2006). The existing preferred evidence evaluation 
method, again based on MCMC techniques, is thermodynamic in- 
tegration (see e.g. O Ruanaidh & Fitzgerald 1996), which is ex- 
tremely computationally intensive but has been used successfully 
in astronomical applications (see e.g. Hobson & McLachlan 2003; 
Marshall et al. 2003; Slosar A. et al. 2003; Niarchou et al. 2004; 
Bassett et al. 2004; Trotta 2007; Beltran et al. 2005; Bridges et al. 
2006). Some fast approximate methods have been used for evidence 
evaluation, such as treating the posterior as a multivariate Gaussian 
centred at its peak (see e.g. Hobson et al. 2002), but this approxima- 
tion is clearly a poor one for multimodal posteriors (except perhaps 
if one performs a separate Gaussian approximation at each mode). 
The Savage-Dickey density ratio has also been proposed (Trotta 
2005) as an exact, and potentially faster, means of evaluating ev- 
idences, but is restricted to the special case of nested hypotheses 
and a separable prior on the model parameters. Various alternative 
information criteria for astrophysical model selection are discussed 
by Liddle (2007), but the evidence remains the preferred method. 

Nested sampling (Skilling 2004) is a Monte Carlo method tar- 
getted at the efficient calculation of the evidence, but also produces 
posterior inferences as a by-product. In cosmological applications, 
Mukherjee et al. (2006) showed that their implementation of the 
method requires a factor of ~ 100 fewer posterior evaluations 
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than thermodynamic integration. To achieve an improved accep- 
tance ratio and efficiency, their algorithm uses an elhptical bound 
containing the current point set at each stage of the process to re- 
strict the region around the posterior peak from which new samples 
are drawn. Shaw et al. (2007) point out that this method becomes 
highly inefficient for multimodal posteriors, and hence introduce 
the notion of clustered nested sampling, in which multiple peaks 
in the posterior are detected and isolated, and separate ellipsoidal 
bounds are constructed around each mode. This approach signifi- 
cantly increases the sampling efficiency. The overall computational 
load is reduced still further by the use of an improved error cal- 
culation (Skilling 2004) on the final evidence result that produces 
a mean and standard error in one sampling, eliminating the need 
for multiple runs. In our previous paper (Feroz & Hobson 2008 - 
hereinafter FH08), we built on the work of Shaw et al. (2007) by 
pursuing further the notion of detecting and characterising multi- 
ple modes in the posterior from the distribution of nested samples, 
and presented a number of innovations that resulted in a substantial 
improvement in sampling efficiency and robustness, leading to an 
algorithm that constituted a viable, general replacement for tradi- 
tional MCMC sampling techniques in astronomical data analysis. 

In this paper, we present further substantial development of 
the method discussed in FH08 and make the first public release of 
the resulting Bayesian inference tool, called MultiNest. In par- 
ticular, we propose fundamental changes to the 'simultaneous ellip- 
soidal sampling' method described in FH08, which result in a sub- 
stantially improved and fully parallelized algorithm for calculat- 
ing the evidence and obtaining posterior samples from distributions 
with (an unkown number of) multiple modes and/or pronounced 
(curving) degeneracies between parameters. The algorithm also 
naturally identifies individual modes of a distribution, allowing for 
the evaluation of the 'local' evidence and parameter constraints as- 
sociated with each mode separately. 

The outline of the paper is as follows. In Section 2, we briefly 
review the basic aspects of Bayesian inference for parameter es- 
timation and model selection. In Section 3, we introduce nested 
sampling and discuss the use of ellipsoidal bounds in Section 4. 
In Section 5, we present the MultiNest algorithm. In Section 6, 
we apply our new algorithms to two toy problems to demonstrate 
the accuracy and efficiency of the evidence calculation and param- 
eter estimation as compared with other techniques. In Section 7, 
we consider the use of our new algorithm for cosmological model 
selection focussed on the extension of the vanilla ACDM model to 
include spatial curvature and a varying equation of state for dark 
energy. We compare the efficiency of MtJLTlNEST and standard 
MCMC techniques for cosmological parameter estimation in Sec- 
tion 7.3. Finally, our conclusions are presented in Section 8. 



2 BAYESIAN INFERENCE 

Bayesian inference methods provide a consistent approach to the 
estimation of a set parameters © in a model (or hypothesis) H for 
the data D. Bayes' theorem states that 

where Pr(0|D,i?) = Vi®) is the posterior probability distri- 
bution of the parameters, Pr(D|0, H) = C{@) is the likelihood, 
Pr(0|iy) = 7r(©) istheprior, andPr(D|//) = 2: is the Bayesian 
evidence. 

In parameter estimation, the normalising evidence factor is 



usually ignored, since it is independent of the parameters 0, and 
inferences are obtained by taking samples from the (unnormalised) 
posterior using standard MCMC sampling methods, where at equi- 
Ubrium the chain contains a set of samples from the parameter 
space distributed according to the posterior. This posterior consti- 
tutes the complete Bayesian inference of the parameter values, and 
can be marginalised over each parameter to obtain individual pa- 
rameter constraints. 

In contrast to parameter estimation problems, in model selec- 
tion the evidence takes the central role and is simply the factor re- 
quired to normalize the posterior over 0: 

Z = j £(0)7r(0)d°0, (2) 

where D is the dimensionality of the parameter space. As the av- 
erage of the likelihood over the prior, the evidence automatically 
implements Occam's razor: a simpler theory with compact param- 
eter space will have a larger evidence than a more complicated one, 
vmless the latter is significantly better at explaining the data. The 
question of model selection between two models Ho and H\ can 
then be decided by comparing their respective posterior probabili- 
ties given the observed data set D, as follows 

Pr(Hi|D) _ Pr(D|Hi)Pr(gi) _ Zi Pr(Hi) 
Pr(H„|D) Pr(D|i?„)Pr(i/(,) 2„ Pr(H„) ' ^' 

where Pr(iyi)/ Pr(_H'o) is the a priori probability ratio for the two 
models, which can often be set to unity but occasionally requires 
further consideration. 

Evaluation of the multidimensional integral (2) is a challeng- 
ing numerical task. The standard technique of thermodynamic in- 
tegration draws MCMC samples not from the posterior directly but 
from C^-K where A is an inverse temperature that is slowly raised 
from w to 1 according to some annealing schedule. It is pos- 
sible to obtain accuracies of within 0.5 units in log-evidence via 
this method, but in cosmological model selection applications it 
typically requires of order 10^ samples per chain (with around 10 
chains required to determine a sampUng error). This makes evi- 
dence evaluation at least an order of magnitude more costly than 
parameter estimation. 



3 NESTED SAMPLING 

Nested sampling (Skilling 2004) is a Monte Carlo technique aimed 
at efficient evaluation of the Bayesian evidence, but also pro- 
duces posterior inferences as a by-product. A full discussion of the 
method is given in FH08, so we give only a briefly description here, 
following the notation of FH08. 

Nested sampling exploits the relation between the likelihood 
and prior volume to transform the multidimensional evidence inte- 
gral (Eq. 2) into a one-dimensional integral. The 'prior volume' X 
is defined by dX = 7r(0)d°0, so that 

X(A) = / 7r(0)d^0, (4) 

Jc(&)>x 

where the integral extends over the region(s) of parameter space 
contained within the iso-likelihood contour £■{&) = X. The evi- 
dence integral (Eq. 2) can then be written as 

Z= f C{X)dX, (5) 
Jo 

where jC{X), the inverse of Eq. 4, is a monotonically decreasing 



© 2008 RAS, MNRAS 000, 1-14 



MultiNest; efficient and robust Bayesian inference 3 


















z 

Q Xa 







(a) 



(b) 



Figure 1. Cartoon illustrating (a) the posterior of a two dimensional prob- 
lem; and (b) the transformed C{X) function where the prior volumes Xi 
are associated with each likelihood Ci. 



function of X. Thus, if one can evaluate the likelihoods d = 
C{Xi), where Xi is a sequence of decreasing values, 

< Xm < • • • < X2 < Xi < Xo = 1, (6) 

as shown schematically in Fig. 1, the evidence can be approximated 
numerically using standard quadrature methods as a weighted sum 



(7) 



In the following we will use the simple trapezium rule, for which 
the weights are given by Wi = i(Xi_i — Xi+i). An example of 
a posterior in two dimensions and its associated function C{X) is 
shown in Fig. 1. 

The summation (Eq. 7) is performed as follows. The itera- 
tion counter is first set to i = and N 'active' (or 'live') sam- 
ples are drawn from the full prior 7r(0) (which is often simply 
the uniform distribution over the prior range), so the initial prior 
volume is Xo = 1. The samples are then sorted in order of their 
likelihood and the smallest (with likelihood jCo) is removed from 
the active set (hence becoming 'inactive') and replaced by a point 
drawn from the prior subject to the constraint that the point has 
a likelihood C > £0. The corresponding prior volume contained 
within this iso-likelihood contour will be a random variable given 
by Xi = tiXo, where h follows the distribution Pr(t) = iVf^"^ 
(i.e. the probability distribution for the largest of N samples drawn 
uniformly from the interval [0, 1]). At each subsequent iteration i, 
the removal of the lowest likelihood point d in the active set, the 
drawing of a replacement with C > Ci and the reduction of the 
corresponding prior volume Xi = UXi-i are repeated, until the 
entire prior volume has been traversed. The algorithm thus travels 
through nested shells of likelihood as the prior volume is reduced. 
The mean and standard deviation of log t, which dominates the ge- 
ometrical exploration, are E[\ogt] = —1/N and cr [log = 1/N. 
Since each value of log t is independent, after i iterations the prior 
volume will shrink down such that log Xj w — (i ± /N. Thus, 
one takes Xi = exp{—i/N). 

The algorithm is terminated on determining the evidence to 
some specified precision (we use 0.5 in log-evidence): at iteration 
i, the largest evidence contribution that can be made by the remain- 
ing portion of the posterior is AZi = £,naxXi, where £max is 
the maximum likelihood in the current set of active points. The 
evidence estimate (Eq. 7) may then be refined by adding a final 
increment from the set of A'^ active points, which is given by 



where wm+j — Xm /N for all j. The final uncertainty on the cal- 
culated evidence may be straightforwardly estimated from a single 
run of the nested sampling algorithm by calculating the relative en- 
tropy of the full sequence of samples (see FH08). 

Once the evidence Z is found, posterior inferences can be eas- 
ily generated using the full sequence of (inactive and active) points 
generated in the nested sampling process. Each such point is simply 
assigned the weight 



Pj 



CjWj 



(9) 



AZ = '^CjWM+j, 



(8) 



where the sample index j runs from I to J\f — AI + N, the total 
number of sampled points. These samples can then be used to cal- 
culate inferences of posterior parameters such as means, standard 
deviations, covariances and so on, or to construct marginalised pos- 
terior distributions. 



4 ELLIPSOIDAL NESTED SAMPLING 

The most challenging task in implementing the nested sampling 
algorithm is drawing samples from the prior within the hard con- 
straint C > d at each iteration i. Employing a naive approach that 
draws blindly from the prior would result in a steady decrease in 
the acceptance rate of new samples with decreasing prior volume 
(and increasing likelihood). 

Ellipsoidal nested sampling (Mukherjee et al. 2006) tries to 
overcome the above problem by approximating the iso-likelihood 
contour £ — £i by a _D-dimensional ellipsoid determined from the 
CO variance matrix of the current set of active points. New points 
are then selected from the prior within this elUpsoidal bound (usu- 
ally enlarged slightly by some user-defined factor) until one is ob- 
tained that has a likelihood exceeding that of the removed lowest- 
likelihood point. In the limit that the ellipsoid coincides with the 
true iso-likelihood contour, the acceptance rate tends to unity. 

Ellipsoidal nested sampling as described above is efficient for 
simple unimodal posterior distributions without pronounced degen- 
eracies, but is not well suited to multimodal distributions. As advo- 
cated by Shaw et al. (2007) and shown in Fig. 2, the sampling ef- 
ficiency can be substantially improved by identifying distinct clus- 
ters of active points that are well separated and constructing an in- 
dividual (enlarged) ellipsoid bound for each cluster. In some prob- 
lems, however, some modes of the posterior may exhibit a pro- 
nounced curving degeneracy so that it more closely resembles a 
(multi-dimensional) 'banana'. Such features are problematic for all 
sampling methods, including that of Shaw et al. (2007). 

In FH08, we made several improvements to the sampling 
method of Shaw et al. (2007), which significantly improved its effi- 
ciency and robustness. Among these, we proposed a solution to the 
above problem by partitioning the set of active points into as many 
sub-clusters as possible to allow maximum flexibility in following 
the degeneracy. These clusters are then enclosed in ellipsoids and 
a new point is then drawn from the set of these 'overlapping' el- 
lipsoids, correctly taking into account the overlaps. Although this 
sub-clustering approach provides maximum efficiency for highly 
degenerate distributions, it can result in lower efficiencies for rel- 
atively simpler problems owing to the overlap between the ellip- 
soids. Also, the factor by which each ellipsoid was enlarged was 
chosen arbitrarily. Another problem with the our previous approach 
was in separating modes with elongated curving degeneracies. We 
now propose solutions to all these problems, along with some addi- 
tional modifications to improve efficiency and robustness still fur- 
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(a) 



(b) 



(c) 



(d) 



(e) 



Figure 2. Cartoon of ellipsoidal nested sampling from a simple bimodal distribution. In (a) we see that the ellipsoid represents a good bound to the active 
region. In (b)-(d), as we nest inward we can see that the acceptance rate will rapidly decrease as the bound steadily worsens. Figure (e) illustrates the increase 
in efficiency obtained by sampUng from each clustered region separately. 



ther, in the MultiNest algorithm presented in the following sec- 
tion. 



5 THE MULTINEST ALGORITHM 

The MultiNest algorithm builds upon the 'simultaneous ellip- 
soidal nested sampling method' presented in FH08, but incorpo- 
rates a number of improvements. In short, at each iteration i of the 
nested sampling process, the full set of A'^ active points is parti- 
tioned and ellipsoidal bounds constructed using a new algorithm 
presented in Section 5.2 below. This new algorithm is far more ef- 
ficient and robust than the method used in FH08 and automatically 
accommodates elongated curving degeneracies, while maintaining 
high efficiency for simpler problems. This results in a set of (pos- 
sibly overlapping) ellipsoids. The lowest-likelihood point from the 
full set of A'^ active points is then removed (hence becoming 'inac- 
tive') and replaced by a new point drawn from the set of ellipsoids, 
correctly taking into account any overlaps. Once a point becomes 
inactive it plays no further part in the nested sampling process, but 
its details remain stored. We now discuss the MtJLTlNEST algo- 
rithm in detail. 



5.1 Unit hypercube sampling space 

The new algorithm for partitioning the active points into clusters 
and constructing ellipsoidal bounds requires the points to be uni- 
formly distributed in the parameter space. To satisfy this require- 
ment, the MtJLTlNEST 'native' space is taken as a D-dimensional 
unit hypercube (each parameter value varies from to 1) in which 
samples are drawn uniformly. All partitioning of points into clus- 
ters, construction of ellipsoidal bounds and sampling are performed 
in the unit hypercube. 

In order to conserve probability mass, the point u = 
(u\,U2, ■ ■ ■ ,ud) in the unit hypercube should be transformed to 
the point = {9i, 62, • • • , &d) in the 'physical' parameter space, 
such that 



tt{6i,92, ■ • ■ , 9d) d6i d92 ■ ■ ■ dOo ~ J duidu2 ■ ■ ■ duo- 

(10) 

In the simple case that the prior 7r(©) is separable 

7r(ei,6'2,--- ,6»d) =7ri(ei)7r2(e2)---7rD(6lD), (11) 

one can satisfy Eq. 10 by setting 

■Kj{9j)d9j — duj. (12) 

Therefore, for a given value of Uj , the corresponding value of 9j 
can be found by solving 



J —00 



(13) 



In the more general case in which the prior 7r(©) is not separable, 
one instead writes 

7T{ei,92, ■■■ ,9d)= MSl)MS2\9l) ■ ■ ■ 710(90191,92 ■ ■ ■ 9d-i) 

(14) 

where we define 

tt(9i,--- ,9,^i,9j,9j+i,--- ,9o)d9,+i---d9o.{l5) 

The physical parameters © corresponding to the parameters u in 
the unit hypercube can then be found by replacing the distributions 
■Kj in Eq. 13 with those defined in Eq. 15 and solving for 9j. The 
corresponding physical parameters are then used to calculate the 
likelihood value of the point u in the unit hypercube. 

It is worth mentioning that in many problems the prior 7r(©) 
is uniform, in which case the unit hypercube and the physical pa- 
rameter space coincide. Even when this is not so, one is often 
able to solve Eq. 13 analytically, resulting in virtually no compu- 
tational overhead. For more complicated problems, two alternative 
approaches are possible. First, one may solve Eq. 13 numerically, 
most often using look-up tables to reduce the computational cost. 
Alternatively, one can re-cast the inference problem, so that the 
conversion between the unit hypercube and the physical parame- 
ter space becomes trivial. This is straightforwardly achieved by, for 
example, defining the new 'likelihood' £'(©) = £(0)7r(0) and 
'prior' 7r'(0) = constant. The latter approach does, however, have 
the potential to be inefficient since it does not make use of the true 
prior 7r(0) to guide the sampling of the active points. 

5.2 Partitioning and construction of ellipsoidal bounds 

In FH08, the partitioning of the set of A'^ active points at each it- 
eration was performed in two stages. First, X-means (Pelleg et al. 
2000) was used to partition the set into the number of clusters that 
optimised the Bayesian Information Criterion (BIC). Second, to 
accommodate modes with elongated, curving degeneracies, each 
cluster identified by X-means was divided into sub-clusters to fol- 
low the degeneracy. To allow maximum flexibility, this was per- 
formed using a modified, iterative fc-means algorithm with k = 2 
to produce as many sub-clusters as possible consistent with there 
being at least D + 1 points in any sub-cluster, where D is the di- 
mensionality of the parameter space. As mentioned above, how- 
ever, this approach can lead to inefficiencies for simpler problems 
in which the iso-likelihood contour is well described by a few (well- 
separated) ellipsoidal bounds, owing to large overlaps between the 
ellipsoids enclosing each sub-cluster. Moreover, the factor / by 
which each ellipsoid was enlarged was chosen arbitrarily. 

We now address these problems by using a new method to par- 
tition the active points into clusters and simultaneously construct 
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the ellipsoidal bound for each cluster (this also makes redundant the 
notion of sub-clustering). At the i**^ iteration of the nested sampling 
process, an 'expectation-maximization' (EM) approach is used to 
find the optimal ellipsoidal decomposition of N active points dis- 
tributed uniformly in a region enclosing prior volume Xi, as set out 
below. 

Let us begin by denoting the set of A'^ active points in the 
unit hypercube by 5 = {ui, M2, • • • , mat} and some partitioning of 
the set into K clusters (called the set's /f-partition) by {Sk\k=u 
where K ^ 1 and 1)^=1 Sk = S. For a cluster (or subset) Sk 
containing nt points, a reasonably accurate and computationally 
efficient approximation to its minimum volume bounding ellipsoid 
is given by 



where 



i5fe = {«eM°|«^(/feCfe)-^«<l}, 



^ "k 

Ck = — y'K- - - t^kY 

rik ^ 



(16) 



(17) 



3=1 



is the empirical covariance matrix of the subset Sk and /x^ = 
X^j^i "3 is il^s center of the mass. The enlargement factor jk en- 
sures that Ek is a bounding ellipsoid for the subset Sk- The vol- 
ume of this ellipsoid, denoted by V(Ek^, is then proportional to 
Vdet(/fcCfc). 

Suppose for the moment that we know the volume Vi^S) of 
the region from which the set S is uniformly sampled and let us 
define the function 

1 

F{S)^^^Y.^{Ek). (18) 

The minimisation of F{S), subject to the constraint F{S) > 1, 
with respect to A'-partitionings {Sfcj^i will then yield an 'opti- 
mal' decomposition into K ellipsoids of the original sampled re- 
gion. The minimisation of F(S) is most easily performed using 
an 'expectation-minimization' scheme as set out below. This ap- 
proach makes use of the result (Lu et al. 2007) that for uniformly 
distributed points, the variation in F(S) resulting from reassigning 
a point with position u from the subset 5**; to the subset Sk' is given 
by 



■ V{Ek')d{u,Sy) V{Ek)d{u,Sky . ^^^^ 



V{Sk' 



V{Sk) 



where 7 is a constant. 



d{u,Sk) = ill - ^ik) (/fcCfc) ^(u-fik) 



(20) 



is the Mahalanobis distance from u to the centroid fik of ellipsoid 
Ek defined in Eq. 16, and 

nkV{S) 



V{Sk) 



N 



(21) 



may be considered as the true volume from which the subset of 
points Sk were drawn uniformly. The approach we have adopted in 
fact differs slightly from that outlined above, since we make further 
use of Eq. 21 to impose the constraint that the volume V{Ek) of 
the fc**^ ellipsoid should never be less than the 'true' volume V{Sk) 
occupied by the subset Sk ■ This can be easily achieved by enlarging 
the ellipsoid Ek by a factor fk, such that its volume V{Ek) = 
max:[V {Ek), V{Sk)], before evaluating Eqs. 18 and 19. 

In our case, however, at each iteration i of the nested sampling 
process, V{S) corresponds to the true remaining prior volume X,, 
which is not known. Nonetheless, as discussed in Section 3, we do 



know the expectation value of this random variable. We thus take 
V(S) = cxp{~i/N) which, in turn, allows us to define V{Sk) 
according to Eq. 21. 

From Eq. 19, we see that defining 



hkiu) = 



V{Ek)d{u,Sk) 
V{Sk) 



(22) 



for a point u £ S and assigning u £ Sk to Sk' only if hk{u) < 
hy (m), 'i k ^k' , is equivalent to minimizing F{S) using the vari- 
ational formula (Eq. 19). Thus, a weighted Mahalanobis metric can 
be used in the /c-means framework to optimize the functional F(S). 
In order to find out the optimal number of ellipsoids, K, a recur- 
sive scheme can be used which starts with K = 2, optimizes this 
2-partition using the metric in Eq. 22 and recursively partitions the 
resulting ellipsoids. For each iteration of this recursion, we employ 
this optimization scheme in Algorithm 1. 

Algorithm 1 Minimizing F{S), subject to F{S) ^ 1, for points 
S = {ui,U2, - ■ ■ ,un} uniformly distributed in a volume V{S). 

1: calculate bounding ellipsoid E and its volume V{E) 

2: enlarge E so that V{E) = max[V(£), V{S)]. 

3: partition S into Si and S2 containing ni and points respec- 
tively by applying fc— means clustering algorithm with K = 2. 

4: calculate Ei, E2 and their volumes V{Ei) and V{E2) respec- 
tively. 

5: enlarge Ek (k = 1, 2) sothaty(Sfc) = raax[V{Ek), V{Sk)]. 

6: for all M € 5* do 

7: assign M to Sfc such that /ifc(M) = mm[hi{x),h2{x)]. 

8: end for 

9: if no point has been reassigned then 

10: go to Step 14. 

11: else 

12: go to step 4. 

13: end if 

14: i(V{Ei) + V(E2) < ViE) or V{E) > 2V{S) then 

15: partition S into Si and S2 and repeat entire algorithm for 

each subset Si and S2. 

16: else 

17: return E as the optimal elUpsoid of the point set S. 

18: end if 



In step 14 of Algorithm 1 we partition the point set S with 
ellipsoidal volume V{E) into subsets Si and S2 with ellipsoidal 
volumes V{Ei) and V{E2) even if ^(^1) + V{E2) > V{E), 
provided V{E) > 2V(S). This is required since, as discussed in 
Lu et al. (2007), the minimizer of F(S) can be over-conservative 
and the partition should still be performed if the ellipsoidal volume 
is greater than the true volume by some factor (we use 2). 

The above EM algorithm can be quite computationally ex- 
pensive, especially in higher dimensions, due to the number of 
eigenvalue and eigenvector evaluations required to calculate el- 
lipsoidal volumes. Fortunately, MultiNest does not need to 
perform the full partitioning algorithm at each iteration of the 
nested sampling process. Once partitioning of the active points 
and construction of the ellipsoidal bounds has been performed 
using Algorithm 1, the resulting ellipsoids can then be evolved 
through scaling at subsequent iterations so that their volumes are 
max[V {Ek),Xi+ink /N], where with Xi+i is the remaining prior 
volume in the next nested sampling iteration and Uk is number of 
points in the subset Sk at the end of iteration. As the MULTI- 
NEST algorithm moves to higher likelihood regions, one would 
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(b) 

Figure 3. Illustrations of the ellipsoidal decompositions returned by Algo- 
lithm 1: the points given as input are overlaid on the resulting ellipsoids. 
1000 points were sampled uniformly from: (a) two non-intersecting ellip- 
soids; and (b) a torus. 



expect the ellipsoidal decomposition calculated at some earlier it- 
eration to become less optimal. We therefore perform a full re- 
partitioning of the active points using Algorithm 1 if ^(5*) ^ h; 
we typically use h — 1.1. 

The approach outlined above allows maximum flexibility and 
sampling efficiency by breaking up a posterior mode resembling 
a Gaussian into relatively few ellipsoids, but a mode possesses a 
pronounced curving degeneracy into a relatively large number of 
small 'overlapping' ellipsoids. In Fig. 3 we show the results of ap- 
plying Algorithm 1 to two different problems in three dimensions: 
in (a) the iso-likelihood surface consists of two non-overlapping 
ellipsoids, one of which contains correlations between the param- 
eters; and in (b) the iso-likelihood surface is a torus. In each case, 
1000 points were uniformly generated inside the iso-likelihood sur- 
face are used as the starting set S in Algorithm 1. In case (a). Al- 
gorithm 1 correctly partitions the point set in two non-overlapping 



ellipsoids with F(S) — 1.1, while in case (b) the point set is parti- 
tioned into 23 overlapping ellipsoids with F{S) = 1.2. 

In our nested sampling application, it is possible that the el- 
lipsoids found by Algorithm 1 might not enclose the entire iso- 
likelihood contour, even though the sum of their volumes is con- 
strained to exceed the prior volume X This is because the ellip- 
soidal approximation to a region in the prior space might not be 
perfect. It might therefore be desirable to sample from a region with 
volume greater than the prior volume. This can easily be achieved 
by using X/e as the desired minimum volume in Algorithm 1, 
where X is the prior volume and e the desired sampling efficiency 
(1/e is the enlargement factor). We also note that if the desire sam- 
pling efficiency e is set to be greater than unity, then the prior can 
be under-sampled. Indeed, setting e > 1 can be useful if one is not 
interested in the evidence values, but wants only to have a general 
idea of the posterior structure in relatively few likelihood evalua- 
tions. We note that, regardless of the value of e, it is always en- 
sured that the ellipsoids enclosing the subsets Sk are always 
the bounding ellipsoids. 

5.3 Sampling from overlapping ellipsoids 

Once the ellipsoidal bounds have been constructed at some iteration 
of the nested sampling process, one must then draw a new point 
uniformly from the union of these ellipsoids, many of which may be 
overlapping. This is achieved using the method presented in FH08, 
which is summarised below for completeness. 

Suppose at iteration i of the nested sampling algorithm, one 
has K ellipsoids {iSfc}. One ellipsoid is then chosen with probabil- 
ity pk equal to its volume fraction 

Pfc = V(£fe)/Kot, (23) 

where Kot = X^s^i ''^(^'fc)- Samples are then drawn uniformly 
from the chosen ellipsoid until a sample is found for which the hard 
constraint £ > £i is satisfied, where d is the lowest-likelihood 
value among all the active points at that iteration. There is, of 
course, a possibility that the chosen ellipsoid overlaps with one or 
more other ellipsoids. In order to take an account of this possibil- 
ity, we find the number of ellipsoids, rie, in which the sample lies 
and only accept the sample with probability l/rie . This provides a 
consistent sampling procedure in all cases. 

5.4 Decreasing the number of active points 

For highly multimodal problems, the nested sampling algorithm 
would require a large number A'^ of active points to ensure that 
all the modes are detected. This would consequently result in very 
slow convergence of the algorithm. In such cases, it would be de- 
sirable to decrease the number of active points as the algorithm 
proceeds to higher likelihood levels, since the number of isolated 
regions in the iso-likelihood surface is expected to decrease with in- 
creasing likelihood, modes Fortunately, nested sampling does not 
require the number of active points to remain constant, provided 
the fraction by which the prior volume is decreased after each it- 
eration is adjusted accordingly. Without knowing anything about 
the posterior, we can use the largest evidence contribution that can 
be made by the remaining portion of the posterior at the i*'^ itera- 
tion AZi = CmaxXi, as the guide in reducing the number of active 
points by assuming that the change in AZ is linear locally. We thus 
set the number of active points Ni at the i^^ iteration to be 

AZi^-i — AZi 
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subject to the constraint iVmin < iVi < iVi_i, where N^i^ is the 
minimum number of active points allowed and tol is the tolerance 
on the final evidence used in the stopping criterion. 

5.5 Parallelization 

Even with the enlargement factor e set to unity (see Section 5.2), 
the typical sampling efficiency obtained for most problems in astro- 
physics and particle physics is around 20-30 per cent for two main 
reasons. First, the ellipsoidal approximation to the iso-likelihood 
surface at any iteration is not perfect and there may be regions of the 
parameter space lying inside the union of the ellipsoids but outside 
the true iso-likelihood surface; samples falling in such regions will 
be rejected, resulting in a sampling efficiency less than unity. Sec- 
ond, if the number of ellipsoids at any given iteration is greater than 
one, then they may overlap, resulting in some samples with C> Ci 
falling inside a region shared by rie ellipsoids; such points are ac- 
cepted only with probability l/rie, which consequently lowers the 
sampling efficiency. Since the sampling efficiency is typically less 
than unity, the MultiNest algorithm can be usefully (and eas- 
ily) parallelized by, at each nested sampling iteration, drawing a 
potential replacement point on each of A^cpu processors, where 
1/iVcpu is an estimate of the sampling efficiency. 

5.6 Identification of modes 

As discussed in FH08, for multimodal posteriors it can prove use- 
ful to identify which samples 'belong' to which mode. There is 
inevitably some arbitrariness in this process, since modes of the 
posterior necessarily sit on top of some general 'background' in 
the probability distribution. Moreover, modes lying close to one 
another in the parameter space may only 'separate out' at rela- 
tively high likelihood levels. Nonetheless, for well-defined, 'iso- 
lated' modes, a reasonable estimate of the posterior mass that each 
contains (and hence the associated 'local' evidence) can be defined, 
together with the posterior parameter constraints associated with 
each mode. To perform such a calculation, once the nested sam- 
pling algorithm has progressed to a likelihood level such that (at 
least locally) the 'footprint' of the mode is well-defined, one needs 
to identify at each subsequent iteration those points in the active 
set belonging to that mode. The partitioning and ellipsoids con- 
struction algorithm described in Section 5.2 provides a much more 
efficient and reliable method for performing this identification, as 
compared with the methods employed in FH08. 

At the beginning of the nested sampling process, all the active 
points are assigned to one 'group' Gi. As outlined above, at sub- 
sequent iterations, the set of N active points is partitioned into K 
subsets {Su} and their corresponding ellipsoids {Ek} constructed. 
To perform mode identification, at each iteration, one of the subsets 
Sk is then picked at random: its members become the first members 
of the 'temporary set' T and its associated ellipsoid becomes 
the first member of the set of ellipsoids £. All the other elUpsoids 
Ek' (k' =^ k) are then checked for intersection with E^ using an 
exact algorithm proposed by Alfano & Greer (2003). Any ellipsoid 
found to intersect with Ek is itself added to £ and the members of 
the corresponding subset Sk are added to T. The set 8 (and con- 
sequently the set T) is then iteratively built up by adding to it any 
ellipsoid not in £ that intersects with any of those already in £, un- 
til no more ellipsoids can be added. Once this is completed, if no 
more ellipsoids remain then all the points in T are (rc)assigned to 
Gi, a new active point is drawn from the union of the ellipsoids 



{Ek} (and also assigned to Gi) and the nested samphng process 
proceeds to its next iteration. 

If, however, there remain ellipsoids {E^} not belonging to £, 
then this indicates the presence of (at least) two isolated regions 
contained within the iso-likelihood surface. In this event, the points 
in T are (re)assigned to the group G2 and the remaining active 
points are (re)assigned to the group G3. The original group Gi 
then contains only the inactive points generated up to this nested 
sampling iteration and is not modified further. The group G3 is 
then analysed in a similar manner to see if can be split further (into 
G3 and Gi), and the process continued until no further splitting is 
possible. Thus, in this case, one is left with an 'inactive' group Gi 

and a collection of 'active' groups G2, G3, A new active point 

is then drawn from the union of the ellipsoids {Ek}, and assigned 
to the appropriate active group, and the nested sampling process 
proceeds to its next iteration. 

At subsequent nested sampling iterations, each of the active 
groups present at that iteration is analysed in the same manner to 
see if it can be split. If so, then the active points in that group are 
(re)assigned to new active groups generated as above, and the orig- 
inal group becomes inactive, retaining only those of its points that 
are inactive. In order to minimize the computational cost, we take 
advantage of the fact that the ellipsoids created by Algorithm 1 can 
only get smaller in later iterations. Hence, within each active group, 
if two ellipsoids arc found not to intersect at some iteration, they are 
not checked for intersection in later iterations. This makes the com- 
putational expense involved in separating out the modes negligible. 

At the end of the nested sampling process, one thus obtains 
a set of inactive groups and a set of active groups, which between 
them partition the full set of (inactive and active) sample points 
generated. It is worth noting that, as the nested sampling process 
reaches higher likelihood levels, the number of active points in any 
particular active group may dwindle to zero, but such a group is still 
considered active since it remains unsplit at the end of the nested 
sampling run. Each active groups is then promoted to a 'mode', 
resulting in a set of L (say) such modes {Mi}. 

As a concrete example, consider the two-dimensional illus- 
tration shown in Fig. 4, in which the solid circles denote active 
points at the nested sampling iteration i — 12, and the open cir- 
cles are the inactive points at this stage. In this illustration, the first 
group Gi remains unsplit until iteration i = ii of the nested sam- 
pling process, at which stage it is split into G2, G3 and G4. The 
group G3 then remains unsplit until iteration i = i2, when it is 
split into G5, Ge and G7. The group G4 remains unsplit until it- 
eration i = 12, when it is spUt into Gs and Gg. The group G2 
remains unspUt at iteration 1 = 12 but the number of active points it 
contains has fallen to zero, since it is a low-lying region of the like- 
lihood function. Thus, at the iteration i = 12, the inactive groups 
are Gi, G3 and G4, and the active groups are G2, G5, GqjGt, Gs 
and Ge. If (say) all of the latter collection of groups were to remain 
active until the end of the nested sampling process, each would 
then be promoted to a mode according to Mi = G2, M2 = G5, 
M3 = Gs,-- - ,M6 = Gg. 



5.7 Evaluating 'local' evidences 

The reliable identification of isolated modes {Mi} allows one to 
evaluate the local evidence associated with each mode much more 
efficiently and robustly than the methods presented in FH08. Sup- 
pose the l^^ mode Mi contains the points {uj} (j = 1, • • • , m). In 
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Figure 4. Cartoon illustrating the assignment of points to groups; see text 
for details. The iso-likelihood contours C = Ci-^ and C = Ci^ are shown 
as the dashed lines and dotted lines respectively. The solid circles denote 
active points at the nested sampling iteration i = 12, and the open circles 
are the inactive points at this stage. 



the simplest approach, the local evidence of this mode is given by 



(25) 



where (as in Eq. 8) Wj = Xm/N for each active point in Mi, 
and for each inactive points Wj = ^{Xi-i - Xi+i), in which 
i is the nested sampling iteration at which the inactive point was 
discarded. In a similar manner, the posterior inferences resulting 
from the mode are obtained by weighting each point in Mi by 
Pj = ^jWj/Zi. 

As outlined in FH08, however, there remain some problems 
with this approach for modes that are sufficiently close to one an- 
other in the parameter space that they are only identified as iso- 
lated regions once the algorithm has proceeded to likelihood values 
somewhat larger than the value at which the modes actually sepa- 
rate. The 'local' evidence of each mode will then be underestimated 
by Eq. 25. In such cases, this problem can be overcome by also 
making use of the points contained in the inactive groups at the end 
of the nested sampUng process, as follows. 

For each mode Mi, expression Eq. 25 for the local evidence is 
replaced by 



(I) 



(26) 



where the additional summation over g includes all the points in the 
inactive groups, the weight Wg = ^ {Xi-i — Xj+i), where i is the 
nested sampling iteration in which the g**^ point was discarded, and 
the additional factors ' are calculated as set out below. Similarly, 
posterior inferences from the l^^ mode are obtained by weighting 
each point in Mi by pj = CjWj/Zi and each point in the inactive 

groups by Pg = LgWgO-g^ jZi. 

The factors Og ' can be determined in a number of ways. The 
most straightforward approach is essentially to reverse the process 
illustrated in Fig. 4, as follows. Each mode Mi is simply an active 
group G that has been renamed. Moreover, one can identify the in- 
active group G' that split to form G at the nested sampling iteration 



I. All points in the inactive group G' are then assigned the factor 



(27) 



where n^Q^ {1) is the number of active points in G at nested sam- 
pling iteration i, and similarly for (i). Now, the group G' may 
itself have been formed when an inactive group G" split at an 
earlier nested sampling iteration i' < i, in which case all the points 
in G" are assigned the factor 



(28) 



The process is continued until the recursion terminates. Finally, all 
points in inactive groups not already assigned have ' — 0. 

As a concrete example, consider M2 = G5 in Fig. 4. In this 
case, the factors assigned to the members of all the inactive groups 
Gi, G3 and G4 are 



(2) 



"03' (*2) 



10 



for g € G3 

for geGi 
for 5 € G4 



(29) 



It is easy to check that the general prescription (Eqs. 27 and 
28) ensures that 



(30) 



i.e. the sum of the local evidences for each mode is equal to the 
global evidence. An alternative method for setting the factors Og 
for which Eq. 30 again holds, is to use a mixture model to analyse 
the full set of points (active and inactive) produced, as outlined in 
Appendix A. 



6 APPLICATIONS 

In this section we apply the MultiNest algorithm described 
above to two toy problems to demonstrate that it indeed calculates 
the Bayesian evidence and makes posterior inferences accurately 
and efficiently. These toy examples are chosen to have features that 
resemble those that can occur in real inference problems in astro- 
and particle physics. 



6.1 Toy model 1: egg-box likelihood 

We first demonstrate the appUcation of MultiNest to a highly 
multimodal two-dimensional problem, for which the likeUhood re- 
sembles an egg-box. The un-normaUzed Ukelihood is defined as 



£(6)i,6>2) =exp 



2 + cos 



(31) 



and we assume a uniform prior W(0, IOtt) for both 61 and 62- A plot 
of the log-likelihood is shown in Fig. 5 and the the prior ranges are 
chosen such that some of the modes are truncated. Hence, although 
only two-dimensional, this toy example is a particularly challeng- 
ing problem, not only for mode identification but also for evaluating 
the local evidence of each mode accurately. Indeed, even obtaining 
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posterior samples efficiently from such a distribution can present a 
challenge for standard Metropolis-Hastings MCMC samplers. We 
note that distributions of this sort can occur in astronomical object 
detection applications (see FH08). 

Owing to the highly multimodal nature of this problem, we 
use 2000 active points. The results obtained with MultiNest are 
illustrated in Fig. 5, in which the dots show the points with the 
lowest likelihood at successive iterations of the nested sampling 
process, and different colours indicate points assigned to differ- 
ent isolated modes as the algorithm progresses. MultiNest re- 
quired ~ 30, 000 likelihood evaluations and evaluated the global 
log-evidence value to be 235.86±0.06, which compares favourably 
with the log-evidence value of 235.88 obtained through numerical 
integration on a fine grid. The local log-evidence values of each 
mode, calculated through numerical integration on a fine grid (de- 
noted as 'truelog(.Z)') and using MultiNest are listed in Table 1. 
We see that there is good agreement between the two estimates. 



6.2 Toy model 2: Gaussian shells likelihood 

We now illustrate the capabilities of our MultiNest in sampling 
from a posterior containing multiple modes with pronounced (curv- 
ing) degeneracies, and extend our analysis to parameters spaces of 
high dimension. 

Our toy problem here is the same one used in FH08 and Al- 
lanach & Lester (2007). The likelihood function in this model is 
defined as. 



C{6) = circ(0; ci,ri,wi) + circ(0; 02,7-2,^2) 



where 



circ(0; c, r, w) 



V2^ 



: exp 



{\e-c\ -r) 



2it;2 



(32) 



(33) 



In two dimensions, this toy distribution represents two well sepa- 
rated rings, centred on the points ci and C2 respectively, each of 
radius r and with a Gaussian radial profile of width w (see Fig. 6). 
With a sufficiently small w value, this distribution is representative 
of the likelihood functions one might encounter in analysing forth- 
coming particle physics experiments in the context of beyond-the- 
Standard-Model paradigms; in such models the bulk of the proba- 



Mode 


true local log (2) 


MultiNest local log(.Z) 


1 


233 


33 


233.20 ±0.08 


2 


233 


33 


233.10 ±0.06 


3 


233 


33 


233.48 ± 0.05 


4 


233 


33 


233.43 ±0.05 


5 


233 


33 


233.65 ±0.05 


6 


233 


33 


233.27 ±0.05 


7 


233 


33 


233.14 ±0.06 


8 


233 


33 


233.81 ± 0.04 


9 


232 


64 


232.65 ±0.12 


10 


232 


64 


232.43 ±0.16 


11 


232 


64 


232.11 ±0.14 


12 


232 


64 


232.44 ±0.11 


13 


232 


64 


232.68 ±0.11 


14 


232 


64 


232.84 ± 0.09 


15 


232 


64 


233.02 ± 0.09 


16 


232 


64 


231.65 ±0.29 


17 


231 


94 


231.49 ±0.27 


18 


231 


94 


230.46 ± 0.36 



Table 1. The local log-evidence values of each mode for the toy model 1 , 
described in Section 6.1, calculated through numerical integration on a fine 
grid (the 'true log(2)') and using the MULTINEST algorithm. 



bility lies within thin sheets or hypersurfaces through the full pa- 
rameter space. 

We investigate the above distribution up to a 30-dimensional 
parameter space with MultiNest. In all cases, the centres of 
the two rings are separated by 7 units in the parameter space, and 
we take wi — W2 = 0.1 and ri = r2 — 2. We make ri and 
r2 equal, since in higher dimensions any slight difference between 
these two values would result in a vast difference between the vol- 
umes occupied by the rings and consequently the ring with the 
smaller r value would occupy a vanishingly small fraction of the 
total probability volume, making its detection almost impossible. 
It should also be noted that setting w = 0.1 means the rings have 
an extremely narrow Gaussian profile and hence they represent an 
'optimally difficult' problem for our ellipsoidal nested sampling al- 
gorithm, since many tiny ellipsoids are required to obtain a suffi- 
ciently accurate representation of the iso-likelihood surfaces. For 
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Peak 1 
Peak 2 

Likelihood 




Figure 6. Toy model 2: (a) two-dimensional plot of the likelihood function defined in Eqs. (32) and (33); (b) dots denoting the points with the lowest likelihood 
at successive iterations of the MultiNest algorithm. Different colours denote points assigned to different isolated modes as the algoiithm progresses. 



Analytical MultiNest 



D 




local log(2) 


log(2) 


local log(2i) 


local log(22) 


2 


-1.75 


-2.44 


-1.72 ±0.05 


-2.28 ± 0.08 


-2.56 ±0.08 


5 


-5.67 


-6.36 


-5.75 ±0.08 


-6.34 ±0.10 


-6.57 ±0.11 


10 


-14.59 


-15.28 


-14.69 ±0.12 


-15.41 ± 0.15 


-15.36 ±0.15 


20 


-36.09 


-36.78 


-35.93 ±0.19 


-37.13 ±0.23 


-36.28 ±0.22 


30 


-60.13 


-60.82 


-59.94 ±0.24 


-60.70 ± 0.30 


-60.57 ±0.32 



Table 2. The true and estimated global and local \og{Z) for toy model 2, as a function of the dimensions D of the parameter space, using MULTINEST. 



the two-dimensional case, with the parameters described above, the 
likelihood is shown in Fig. 6. 

In analysing this problem using the methods presented in 
FH08, we showed that the sampling efficiency dropped signifi- 
cantly with increasing dimensionality, with the efficiency being less 
than 2 per cent in 10 dimensions, with almost 600, 000 likelihood 
evaluations required to estimate the evidence to the required accu- 
racy. Using 1000 active points in MultiNest, we list the evaluated 
and analytical evidence values in Table 2. The total number of like- 
lihood evaluations and the sampling efficiencies are listed in Table 
3. For comparison, we also list the number of likelihood evaluations 
and the sampling efficiencies with the ellipsoidal nested sampling 
method proposed in FH08. One sees that MultiNest requires an 
order of magnitude fewer likelihood evaluations than the method 
of FH08. In fact, the relative computational cost of MultiNest is 
even less than this comparison suggests, since it no longer performs 
an eigen-analysis at each iteration, as discussed in Section 5.2. In- 
deed, for this toy problem discussed, the EM partitioning algorithm 
discussed in Section 5.2 was on average called only once per 1000 
iterations of the MultiNest algorithm. 



7 COSMOLOGICAL PARAMETER ESTIMATION AND 
MODEL SELECTION 

Likelihood functions resembling those used in our toy models do 
occur in real inference problems in astro- and particle physics, such 
as object detection in astronomy (see e.g. Hobson & McLachlan 
2003; FH08) and analysis of beyond-the-Standard-Model theories 
in particle physics phenomenology (see e.g. Feroz et al. 2008). 



from FH08 MULTINEST 
D ^likc Efficiency Mike Efficiency 



2 


27, 658 


15.98% 


7,370 


70.77% 


5 


69, 094 


9.57% 


17,967 


51.02% 


10 


579, 208 


1.82% 


52,901 


34.28% 


20 


43,093,230 


0.05% 


255,092 


15.49% 


30 






753, 789 


8.39% 



Table 3. The number of likelihood evaluations and sampling efficiency for 
the ellipsoidal nested sampHng algoiithm of FH08 and MULTINEST, when 
applied to toy model 2 as a function of the dimension D of the parameter 
space. 



Nonetheless, not all likelihood functions are as challenging and it 
is important to demonstrate that MultiNest is more efficient (and 
certainly no less so) than standard Metropolis-Hastings MCMC 
sampling even in more straightforward inference problems. 

An important area of inference in astrophysics is that of cos- 
mological parameter estimation and model selection, for which the 
likelihood functions are usually quite benign, often resembling a 
single, broad multivariate Gaussian in the allowed parameter space. 
Therefore, in this section, we apply the MultiNest algorithm to 
analyse two related extensions of the standard cosmology model: 
non-flat spatial curvature and a varying equation of state of dark 
energy. 

The complete set of cosmological parameters and the ranges 
of the uniform priors assumed for them are given in Table 4, where 
the parameters have their usual meanings. With Slk = and 
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0.018 




0.032 


0.04 




0.16 


0.98 


^ sS 


1.1 


0.01 


^ r ^ 


0.5 


-0.1 




0.1 


-1.5 


^ li) ^ 


-0.5 


0.8 




1.2 


2.6 


^ log[10lOAs] ^ 


4.2 



Table 4. Cosmological parameters and unifomi priors ranges for the vanilla 
ACDM model, plus spatial curvature fii; and dark energy equation of state 
parameter w. 



w = —\ this model then represents the 'vanilla' ACDM cosmol- 
ogy. In addition, mirroring the recent analysis of the WMAP 5-year 
(WMAP5) data (Dunkley et al. 2008), a Sunyaev-Zel'dovich ampli- 
tude is introduced, with a uniform prior in the range [0, 2]. We have 
chosen three basic sets of data: CMB observations alone; CMB plus 
the Hubble Space Telescope (HST) constraint on Hq (Freedman 
et al. 2001); and CMB plus large scale structure (LSS) constraints 
on the matter power spectrum derived from the luminous red galaxy 
(LRG) subset of the Sloan Digital Sky Survey (SDSS; Tegmark 
et al. 2006; Tegmark et al. 2004) and the two degree field survey 
(2dF; Cole et al. 2005). In addition, for the dark energy analysis we 
include distance measures from supernovae la data (Kowalski et al. 
2008). The CMB data comprises WMAP5 observations (Hinshaw 
et al. 2008) -I- higher resolution datasets from the Arcminute Cos- 
mology Bolometer Array (ACBAR; Reichardt et al. 2008) -I- the 
Cosmic Background Imager (CBI; Readhead et al. 2004; Sievers 
J. L. et al. 2007; CBI Supplementary Data 2006) -I- Balloon Ob- 
servations of Millimetric Extragalactic Radiation and Geophysics 
(BOOMERANG; Piacentini et al. 2006; Jones et al. 2006; Mon- 
troy et al. 2006). 

Observations of the first CMB acoustic peak cannot in them- 
selves constrain the spatial curvature Slk. This could be constrained 
using angular scale of the first acoustic peak coupled with a knowl- 
edge of the distance to the last scattering surface, but the latter is a 
function of the entire expansion history of the universe and so there 
is a significant degeneracy between and the Hubble parameter 
H{z). This dependence, often termed the 'geometric degeneracy', 
can be broken, however, since measurements at different redshifts 
can constrain a sufficiently different functions of fik and H. Thus, 
the combination of CMB data with measurements of the acous- 
tic peak structure in the matter power spectrum derived from large 
scale structure surveys such as the LRG subset of Sloan can place 
much tighter constraints on curvature than with either alone (see 
e.g. Eisenstein et al. 2005; Tegmark et al. 2006; Komatsu et al. 
2008). 

Inflation generically predicts a flat universe (Guth 1981). The 
tightest current constraint suggests fik ~ 10^^, whereas inflation 
lasting over 60 e-folds would produce flatness at the level of 10^^ 
(Komatsu et al. 2008). Thus, at present, the data is not capable of re- 
futing or confirming such an inflationary picture. From a Bayesian 
point of view, however, one can still assess whether the data cur- 
rently prefers the inclusion of such a physical parameter. 

The algorithmic parameters of MultiNest were appropri- 
ately chosen given our a priori knowledge of the uni-modal form of 
typical cosmological posteriors, the dimensionality of the problem 
and some empirical testing. The number of active points was set 
to TV = 400 and a sampling efficiency e of 0.3 means that MPI 
parallelisation across 4 CPUs is optimal (with a further 4 openmp 



70 
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Figure 7. Breaking of the 'geometric' degeneracy in CMB data (blue) via 
the addition of HST (green) and large scale structure data (red). 



Dataset \ Model 


vanilla + f2k vanilla + f^k + w 


CMB alone 


-0.29 ±0.27 


CMB + HST 


-1.56 ±0.27 


ALL 


-2.92 ±0.27 -1.29 ±0.27 



Table 5. Differences of log evidences for both models and the three datasets 
described in the text. [Negative (Positive) values represent lower (higher) 
preference for the parameterisation change] 



threads per MPI CPU used by CAMB's multithreading facility). 
This represents a relatively modest computational investment. All 
of the inferences obtained in this section required between 40,000 
and 50,000 likelihood evaluations. 

7.1 Results: spatial curvature 

Fig. 7. illustrates the progressively tighter constraints placed on fik 
and Ho produced by combining CMB data with other cosmological 
probes of large scale structure. The geometric degeneracy is clearly 
unbroken with CMB data alone, but the independent constraints on 
Ho by HST are seen to tighten the constraint somewhat. Including 
LSS data, specifically the LRG data, markedly reduces the uncer- 
tainty on the curvature so that at 1-cr we can limit the curvature 
range to —0.043 ^ f2k ^ 0.004. The asymmetry of this constraint 
leaves a larger negative tail of f2k resulting in a mean value that is 
only slightly closed. However, even these most stringent parameter 
constraints available, we see no statistically significant deviation 
from a spatially flat universe. The Bayesian evidence, in penalis- 
ing excessive model complexity should tell us whether relaxing the 
constraint on flatness is preferred by the data. Our results (Table 5) 
very clearly rule out the necessity for such an addition in anything 
other than with CMB data alone. This implies that the inclusion of 
spatial curvature is an unnecessary complication in cosmological 
model building, given the currently available data. 

7.2 Results: varying equation of state of dark energy 

The properties of the largest universal density component are still 
largely unknown, yet such a component seems crucial for the uni- 
verse to be spatially flat. It has thus been argued by some (Wright 
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Figure 8. Joint constraints on universal geometry Slj^ and the equation of 
state of dark energy w using WMAP5 + HST + LSS + supernovae data. 



2006 & Tegmark et al. 2006) that it is inappropriate to assume spa- 
tial flatness when attempting to vary the properties of the dark en- 
ergy component beyond those of a simple cosmological constant. 
Here we allow for the variation of the dark energy equation of state 
parameter w. We will therefore proceed by placing joint constraints 
on both w and f^k, as performed in Komatsu et al. (2008). Once 
again we encounter a serious degeneracy, this time between Qy^ and 
w. With the geometric degeneracy, combining cosmological obser- 
vations of the universe at different redshifts was sufficient to break 
the dependence, but when dark energy is dynamical we must use at 
least a third, independent data set. In this analysis, we have there- 
fore included distance measures from type la supernovae observa- 
tions (Kowalski et al. 2008). Using this combination of data pro- 
duces impressively tight constraints on both w and Jlk ; indeed the 
resulting constraints on the spatial curvature are tighter than those 
obtained in the previous section, for which w was held constant at 
w = —1. This is primarily due to the near orthogonality of the con- 
straints provided by supernovae and the CMB. Once again we find 
little evidence to support a departure from the basic vanilla cosmol- 
ogy (see Table 5). To within estimated uncertainty, the Bayesian 
evidence is at least one log unit greater for a flat universe with dark 
energy in the form of a cosmological constant. 



7.3 Comparison of MultiNest and MCMC 'quick look' 
parameter constraints 

The above results are in good agreement with those found by Ko- 
matsu et al. (2008) using more traditional MCMC methods, and 
indicate that MultiNest has produced reliable inferences, both 
in terms of the estimated evidence values and the derived posterior 
parameter constraints. It is often the case, however, that cosmolo- 
gists wish only to obtain a broad idea of the posterior distribution 
of parameters, using short MCMC chains and hence relatively few 
likelihood evaluations. In this section, we show that the MULTI- 
NEST algorithm can also perform this task by setting the number 
of active points, A'^, to a smaller value. 

In order to illustrate this functionality, we analyse the 
WMAP5 CMB data-set in the context of the vanilla ACDM cos- 
mology using both MultiNest and the publicly available Cos- 
moMC Lewis & Bridle (2002) package, which uses an MCMC 
sampler based on a tailored version of the Metropolis-Hastings 
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Figure 9. 1 -D marginalized posteriors for the flat- ACDM cosmology ob- 
tained with: CosmoMC using 48,000 likelihood evaluations (solid black); 
CosmoMC using 3,200 likelihood evaluations (dotted pink); and MULTI- 
NEST using 3,100 likelihood evaluations (dashed blue). 



method. We imposed uniform priors on all the parameters. The 
prior ranges for Qhh^ , ficdm/i^, 6, and r is listed in Table 4. In 
addition, Us was allowed to vary between 0.8 and 1.2, log(10" A^) 
between 2.6 and 4.2 and the Sunyaev-Zel'dovich amplitude be- 
tween and 2. 

To facilitate later comparisons, we first obtained an accu- 
rate determination of the posterior distribution using the traditional 
method by running CosmoMC to produce 4 long MCMC chains. 
The Gelman-Rubin statistic R returned by CosinoMC indicated 
that the chains had converged to within i? ~ 1.1 after about 6000 
steps per chain, resulting in ~ 24,000 likelihood evaluations. To 
be certain of determining the 'true' posterior to high accuracy, we 
then ran the MCMC chains for a further 6,000 samples per chain, 
resulting in a total of 48,000 likelihood evaluations, at which point 
the convergence statistic was R « 1.05. 

As stated above, however, one often wishes to obtain only a 
'quick-and-dirty' estimate of the posterior in the first stages of the 
data analysis. As might be typical of such analyses, we ran Cos- 
moMC again using 4 MCMC chains, but with only 800 steps per 
chain, resulting in a total of 3,200 likelihood evaluations. As a com- 
parison, we also ran the MultiNest algorithm using only 50 ac- 
tive points and with the sampling efficiency e set to unity; this re- 
quired a total of 3,100 likelihood evaluations. In Fig. 9, we plot the 
I-D marginalized posteriors for derived from the two analyses, to- 
gether with the results of the longer CosmoMC analysis described 
above. It is clear from these plots that both the MultiNest and 
MCMC 'quick-look' results compare well with the 'true' posterior 
obtained from the more costly rigorous analysis. 
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8 DISCUSSION AND CONCLUSIONS 

We have described a highly efficient Bayesian inference tool, called 
MultiNest, which we have now made freely available for aca- 
demic purposes as a plugin that is easily incorporated into the COS- 
MOMC software. On challenging toy models that resemble real in- 
ference problems in astro- and particle physics, we have demon- 
strated that MultiNest produces reliable estimates of the evi- 
dence, and its uncertainty, and accurate posterior inferences from 
distributions with multiple modes and pronounced curving degen- 
eracies in high dimensions. We have also demonstrated in cos- 
mological inference problems that MultiNest produces accurate 
parameter constraints on similar time scales to standard MCMC 
methods and, with negligible extra computational effort, also yields 
very accurate Bayesian evidences for model selection. As a cos- 
mological appUcation we have considered two extensions of the 
basic vanilla ACDM cosmology: non-zero spatial curvature and a 
varying equation of state of dark energy. Both extensions are deter- 
mined to be unnecessary for the modelling of existing data via the 
evidence criterion, confirming that with the advent of five years of 
WMAP observations the data is still satisfied by a ACDM cosmol- 
ogy- 

As a guide for potential users, we conclude by noting that the 
MultiNest algorithm is controlled by two main parameters: (i) 
the number of active points N; and (ii) the maximum efficiency e 
(see Section 5.2). These values can be chosen quite easily as out- 
lined below. First, A'^ should be large enough that, in the initial sam- 
pling from the full prior space, there is a high probability that at 
least one point lies in the 'basin of attraction' of each mode of the 
posterior. In later iterations, active points will then tend to populate 
these modes. It should be remembered, of course, that A'^ must al- 
ways exceed the dimensionality D of the parameter space. Also, in 
order to calculate the evidence accurately, A^ should be sufficiently 
large so that all the regions of the parameter space arc sampled ade- 
quately. For parameter estimation only, one can use far fewer active 
points. For cosmological data analysis, we found 400 and 50 active 
points to be adequate for evidence evaluation and parameter esti- 
mation respectively. The parameter e controls the sampling volume 
at each iteration, which is equal to the sum of the volumes of the 
ellipsoids enclosing the active point set. For parameter estimation 
problems, e should be set to 1 to obtain maximum efficiency with- 
out undersampling or to a lower value if one wants to get a general 
idea of the posterior very quickly. For evidence evaluation in cos- 
mology, we found setting e ~ 0.3 ensures an accurate evidence 
value. 
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APPENDIX A: LOCAL EVIDENCE EVALUATION USING 
A GAUSSIAN MIXTURE MODEL 

As mentioned in Section 5.7, an alternative method for determining 
the local evidence and posterior constraints associated with each 
identified mode Mi is to analyse the full set of samples using a 
mixture model. In what follows, we assume that the modes of the 
'impUed' posterior V{u)ivi the unit hypercube space are each well 
described by a multivariate Gaussian, leading to a Gaussian mixture 
model, but the model for the mode shape can be easily changed to 
another distribution. 

Let us assume that, at the end of the nested sampUng pro- 
cess, the full set of M (inactive and active) points obtained is 
{mi,M2,--- ,uj^} and L modes Mi,M2,--- ,Ml modes have 
been identified. The basic goal of the method is to assign an extra 
factor a^P to every point (j = Ito J\f), so that the estimate (Eq. 26) 
for the local evidence associated with the mode Mi is replaced by 



3=1 



(Al) 



where the weights Wj for the inactive and active points are the same 
as those used in Eq. 26. Similarly, posterior inferences from the 
mode Ml are obtained by weighting every point (j = 1 to M) by 



The factors a 



Pj = C-jWja) > /Zi. 



(0 



are determined by modelling each mode Mi 
as a multivariate Gaussian with 'normalised amplitude' Ai, mean 
Hi and covariance matrix Ci, such that 



and calculating the initial values of each Zi using Eq. Al. In the 
M-step of the algorithm one then obtains the maximum-likelihood 
estimates of the parameters. These are easily derived (see e.g. 
Dempster et al. (1977)) to be 
ni 



Al = 



1 ^ 

-E 



a) 'Uj 



/Xi)(«j - Mi) 



(A6) 



(A7) 



(A8) 



where n; — 



(0 

1 "j ' 



Tld^i ri; and Uj = UjLjWj /Zi are 



the locally posterior-weighted sample points. In the subsequent E- 
step of the algorithm on then updates the values using Eq. A4 
and updates Zi using Eq. Al. We further impose the constraint that 
= if Uj ^ Ml and its likelihood £j is greater than the lowest 
likelihood of the points in Mi . The EM algorithm is then iterated 
to convergence. 



G{u;ni,&i) : 



V{u)ocY,AiG{u;iJ.i,Ci), (A2) 
Cj) is given by 

[-i(«-/x(rcr^(«-ixo] , 



where the Gaussian unit- volume G{u; ni 
1 



■exp 



(2,r)2|Ci| 

(A3) 

and the values of the parameters © = ({A;}, {/x;}, {C;}) are to 
be determined from the sample points {uj}. Since the scaling in 
Eq. A2 is arbitrary, it is convenient to set = 1. 

For a given set of parameter values 0, our required factors are 

a«(0) = Pr(M,|«,,©) 

Pr(Hj|Mi,0)Pr(Mi|Q) 
Ef=iPr(«,|M,,0)Pr(M,|0) 

AiG{u;ni,Ci) 
Eti^iG'(«;M!,CO 

Our approach is to optimise the parameters 0, and hence de- 
termine the factors a^l®), using an expectation-maximization 
(EM) algorithm. The jdgorithm is initialized by setting 



(A4) 



1 if Uj e Ml 
otherwise 



(A5) 



© 2008 RAS, MNRAS 000, 1-14 



